Exploring the Simpson’s Paradox Within the Penguin Dataset
Nice things
I learn new quatro things…
Author
Affiliation
me
University
Published
January 28, 2025
Keywords
Quarto, Paradox, Data Analysis
A few consideration about this doc
This Quarto document serves as a practical illustration of the concepts covered in the productive workflow online course. It’s designed primarily for educational purposes, so the focus is on demonstrating Quarto techniques rather than on the rigor of its scientific content.
1 Introduction
This document offers a straightforward analysis of the well-known penguin dataset. It is designed to complement the Productive R Workflow online course.
data <- data %>%mutate(bill_depth_mm =as.numeric(bill_depth_mm) # Convert to numeric )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `bill_depth_mm = as.numeric(bill_depth_mm)`.
Caused by warning:
! NAs introduced by coercion
3 Bill Length and Bill Depth
Now, let’s make some descriptive analysis, including summary statistics and graphs.
What’s striking is the slightly negative relationship between bill length and bill depth. One could definitely expect the opposite.
Show the code
p <- data %>%ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +geom_point(color="#69b3a2") +labs(x ="Bill Length (mm)",y ="Bill Depth (mm)",title =paste("Surprising relationship?")) +my_theme()ggplotly(p)
Relationship between bill length and bill depth. All data points included.
It is also interesting to note that bill length a and bill depth are quite different from one specie to another. The average of a variable can be computed as follow:
bill length and bill depth averages are summarized in the 2 tables below.
Show the code
#| layout-ncol: 2# Calculate the average bill length per speciesbill_length_per_specie <- data %>%group_by(species) %>%summarise(average_bill_length =mean(bill_length_mm, na.rm =TRUE) )#bill_length_per_specie# Display the bill length tablekable(bill_length_per_specie)
species
average_bill_length
Adelie
38.80872
Chinstrap
48.83382
Gentoo
47.50488
Show the code
# Calculate the average bill depth per speciesbill_depth_per_specie <- data %>%group_by(species) %>%summarise(average_bill_depth =mean(bill_depth_mm, na.rm =TRUE) )#bill_depth_per_specie# Display the bill depth tablekable(bill_depth_per_specie)
species
average_bill_depth
Adelie
18.34228
Chinstrap
18.42059
Gentoo
14.98211
Show the code
# Extract and round the average bill length for the Adelie speciesbill_length_adelie <- bill_length_per_specie %>%filter(species =="Adelie") %>%pull(average_bill_length) %>%round(2)
For instance, the average bill length for the specie Adelie is 38.81.
Now, let’s check the relationship between bill depth and bill length for the specie Adelie on the island Torgersen:
Show the code
# Use the function in functions.Rp1 <-create_scatterplot(data, "Adelie", "#6689c6")p1=p1+my_theme() p2 <-create_scatterplot(data, "Chinstrap", "#e85252")p2=p2+my_theme() p3 <-create_scatterplot(data, "Gentoo", "#9a6fb0")p3=p3+my_theme() p1 + p2 + p3
There is actually a positive correlation when split by species.